Advanced models

1 Getting started

1.1 Required packages

library(highcharter) # For interactive plots
library(WDI)         # For World Bank data
library(openxlsx)    # A cool package to deal with Excel files/formats
library(quantmod)    # Package for financial data extraction
library(tsibble)     # TS with dataframes framework 
library(fable)       # Package for time-series models & predictions
library(feasts)      # Package for time-series analysis
library(forecast)    # Another package for TS (forecasting)
library(tseries)     # Simple package for time-series tests
library(plotly)      # For 3D graphs
library(mvtnorm)     # For multivariate Gaussian distributions
library(crypto2)     # For cryto data
library(ggsci)       # For color palettes
library(rugarch)     # For GARCH models
library(fGarch)      # For GARCH models
library(vars)        # For VAR (Vector AutoRegressive) models
library(tidyverse)   # THE library for data science; at the end to override other functions
set.seed(42)         # Random seed

impute <- function(v, n = 6){     # Imputation function
  for(j in 1:n){
    ind <- which(is.na(v))
    if(length(ind)>0){
      if(ind[1]==1){ind <- ind[-1]}
      v[ind] <- v[ind-1]
    }
  }
  return(v)
}

1.2 Fetching some (crypto) data

coins <- crypto_list() 
c_info <- crypto_info(coin_list = coins, limit = 30)
❯ Scraping crypto info
❯ Processing crypto info
coin_names <- c("Bitcoin", "Ethereum", "Tether USDt", "BNB", "USDC", "Solana")
coin_hist <- crypto_history(coins |> dplyr::filter(name %in% coin_names),
                            start_date = "20170101",
                            end_date = "20251206")
❯ Scraping historical crypto data
❯ Processing historical crypto data

Coin USDC does not have data available! Cont to next coin.

Coin Solana does not have data available! Cont to next coin.

Coin Solana does not have data available! Cont to next coin.
coin_hist <- coin_hist |>  # Timestamps are at midnight, hence we add a day.
  dplyr::mutate(date = 1 + as.Date(as.POSIXct(timestamp, origin="1970-01-01")))
coin_hist <- coin_hist |> 
  group_by(name) |>
  mutate(return = close / lag(close) - 1 ) |>
  ungroup()

2 ARCH-type models

2.1 Realized volatility

Then, let’s have a look at volatility. It’s measured as

\[v_t=\sqrt{\frac{1}{T-1}\sum_{s=1}^T(r_{t-s}-\bar{r})^2},\] where \(\bar{r}\) is the mean return in the sample of \(T\) observations. Below, we use an optimized (C++-based function) to compute it via rolling standard deviation over 20 days.

library(RcppRoll)
coin_hist <- coin_hist |>
  group_by(name) |>
  na.omit() |>
  mutate(real_vol_20 = roll_sd(return, n = 20, fill = NA, na.rm = T))
coin_hist |>
  filter(name == "Bitcoin") |>
  pivot_longer(all_of(c("close", "real_vol_20")), names_to = "series", values_to = "value") |>
  ggplot(aes(x = date, y = value, color = series)) + geom_line() + theme_bw() +
  facet_wrap(vars(series), nrow = 2, scales = "free")# + scale_y_log10()
Warning: Removed 19 rows containing missing values or values outside the scale range
(`geom_line()`).

Now let’s have a look a the autocorrelation of the volatility: it’s highly persistent.

crypto_acf <- coin_hist |> 
  na.omit() %>%
  split(.$name) %>% 
  map(~acf(.$real_vol_20, plot = F)) %>% 
  map_dfr(
    ~data.frame(lag = .$lag, 
                acf = .$acf,
                ci = qnorm(0.975)/sqrt(.$n.used)
    ), .id = "name") 
crypto_acf |> 
  filter(lag > 0, lag < 15) |>
  ggplot(aes(x = lag, y = acf, fill = name)) + 
  geom_col(position = "dodge", alpha = 0.8) + theme_bw() +
  scale_fill_viridis_d(option = "magma", end = 0.9, begin = 0.1)

Hence, in practice, simple autoregressive models can be too limited because some variables are not exactly stationary.

The realized volatility is based on past returns. But markets (equity mostly) have another very cool and forward-looking indicator called the VIX. The VIX is slightly different, because it is based on forward-looking option prices (taking their implied volatilities over specific horizons). Still, it is linked (empirically) to the traditional volatility.

Let’s have a look below.

min_date <- "1998-01-01"                 # Starting date
max_date <- "2025-11-07"                 # Ending date
getSymbols("^VIX", src = 'yahoo',        # The data comes from Yahoo Finance
           from = min_date,              # Start date
           to = max_date,                # End date
           auto.assign = TRUE, 
           warnings = FALSE)
[1] "VIX"
VIX <- VIX |> as.data.frame() |>
  rownames_to_column("date") |>
  dplyr::mutate(date = as.Date(date))
head(VIX,7)                                   # Have a look at the result!
        date VIX.Open VIX.High VIX.Low VIX.Close VIX.Volume VIX.Adjusted
1 1998-01-02    24.34    24.93   23.42     23.42          0        23.42
2 1998-01-05    24.11    25.02   23.02     24.36          0        24.36
3 1998-01-06    25.20    25.97   24.87     25.66          0        25.66
4 1998-01-07    26.15    27.43   25.07     25.07          0        25.07
5 1998-01-08    26.24    26.70   25.62     26.01          0        26.01
6 1998-01-09    25.79    29.35   25.69     28.69          0        28.69
7 1998-01-12    28.69    31.08   28.02     28.02          0        28.02
vix_chart <- VIX |> dplyr::select(VIX.Close)
rownames(vix_chart) <- VIX |> dplyr::pull(date)
highchart(type = "stock") %>%
    hc_title(text = "Evolution of the VIX") %>%
    hc_add_series(as.xts(vix_chart)) %>%
    hc_tooltip(pointFormat = '{point.y:.3f}') 

We clearly see that the series exhibits some moments of extreme activity and its distribution is not the same through time. This is referred to as “clusters of volatility”.

2.2 Some theory

Therefore, we also need to introduce models that allow for this type of feature. In 1982, Robert Engle proposed such a model, called ARCH for “AutoRegressive Conditional Heteroskedasticity”, which was generalized in 1986 to GARCH models. Engle obtained the Nobel Prize in economics in part for this contribution.

As we show below, GARCH models have a direct link with auto-regressive processes!

The idea is to work with the innovations, \(e_t\) and to specify them a bit differently, i.e., like this: \(e_t=\sigma_t z_t\), where \(\sigma_t\) is going to be a changing standard deviation and \(z_t\) is a simple white noise. What matters is that the vol term is modelled as \[\sigma^2_t = a+\sum_{j=1}^pd_j\sigma^2_{t-j} + \sum_{k=1}^qb_ke^2_{t-k},\] hence, the value of \(\sigma_t^2\) depends both on its past and on the past of squared innovations.

There have been MANY extension to these models: T-GARCH, E-GARCH, etc. See for instance GARCH models: structure, statistical inference and financial applications.

2.3 Examples: estimation

There are many packages for GARCH estimation. We propose two: ruGARCH and fGARCH.

In fGARCH, the model is:

\[\sigma_t^2 = \omega + \alpha e_{t-1} + \beta \sigma_{t-1}^2.\] In the output, there are some additional parameters that allow to propose a more general model:

  • shape is the shape parameter of the conditional distribution.
  • skew is the skewness parameter of the conditional distribution.
  • delta a numeric value, the exponent delta of the variance recursion, fixed at 2 usually.

\(\delta\) values other than 2 come from P-GARCH models:

\[\begin{align} \varepsilon_t & = \sigma_t z_t, \qquad z_t \overset{\text{i.i.d.}}{\sim} \text{Skew}(\text{shape},\text{skew}), \\ \sigma_t^{\delta} &= \omega + \sum_{i=1}^{q} \alpha_i\bigl(|\varepsilon_{t-i}| - \gamma_i \varepsilon_{t-i}\bigr)^{\delta} + \sum_{j=1}^{p} \beta_j\,\sigma_{t-j}^{\delta}, \end{align}\]

The distribution is the Skewed Generalized Error Distribution that allows for more flexibility than the simple Gaussian law. It nests many simpler distribution, like the Cauchy, Laplace, GED and Student laws for instance.

fit_f <- garchFit(formula = ~garch(1,1), 
         data = coin_hist |> 
           filter(name == "Bitcoin") |> 
           pull(return), 
         trace = F) 
fit_f

Title:
 GARCH Modelling 

Call:
 garchFit(formula = ~garch(1, 1), data = pull(filter(coin_hist, 
    name == "Bitcoin"), return), trace = F) 

Mean and Variance Equation:
 data ~ garch(1, 1)
<environment: 0x147db6828>
 [data = pull(filter(coin_hist, name == "Bitcoin"), return)]

Conditional Distribution:
 norm 

Coefficient(s):
        mu       omega      alpha1       beta1  
1.2790e-03  8.8107e-05  1.0231e-01  8.3473e-01  

Std. Errors:
 based on Hessian 

Error Analysis:
        Estimate  Std. Error  t value Pr(>|t|)    
mu     1.279e-03   5.246e-04    2.438   0.0148 *  
omega  8.811e-05   1.087e-05    8.108 4.44e-16 ***
alpha1 1.023e-01   1.184e-02    8.643  < 2e-16 ***
beta1  8.347e-01   1.577e-02   52.934  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Log Likelihood:
 7086.565    normalized:  1.972874 

Description:
 Sun Nov 30 19:04:14 2025 by user:  

Now, with rugarch

spec <- ugarchspec()
fit_r <- ugarchfit(data = coin_hist |> 
                     filter(name == "Bitcoin") |> 
                     na.omit() |>
                     pull(return), 
                   spec = spec)
fit_r@fit$coef
           mu           ar1           ma1         omega        alpha1 
 1.271558e-03 -4.037899e-01  3.681241e-01  8.677615e-05  1.002666e-01 
        beta1 
 8.370735e-01 

The results are not exactly the same! mu and omega are close…
We recall the model below \[s_t = \omega + \alpha e_{t-1} + \beta s_{t-1}.\] ar1 is the autocorrelation coefficient, hence \(\beta\) and ma1 is \(\alpha\).

With both libraries, it is possible to make predictions.

predict(fit_f, n.ahead = 5)
  meanForecast  meanError standardDeviation
1  0.001279035 0.02777110        0.02777110
2  0.001279035 0.02847428        0.02847428
3  0.001279035 0.02911777        0.02911777
4  0.001279035 0.02970811        0.02970811
5  0.001279035 0.03025082        0.03025082
ugarchforecast(fit_r, n.ahead = 3)

*------------------------------------*
*       GARCH Model Forecast         *
*------------------------------------*
Model: sGARCH
Horizon: 3
Roll Steps: 0
Out of Sample: 0

0-roll forecast [T0=1979-10-14]:
       Series   Sigma
T+1 0.0020578 0.02914
T+2 0.0009541 0.02971
T+3 0.0013998 0.03024

The values are quite close!

A cool feature of {rugarch} is that you can use forward rolling, as in actual backtests (more on that below).

3 Vector Auto-Regression

3.1 The VAR(1) model

Now we are switching to full multivariate mode!
With the Vector Auto-Regression (VAR). It is written as:

\[X_t = \mu + A X_{t-1} + \varepsilon_t, \qquad \varepsilon_t \sim \text{i.i.d. } (0,\Sigma_\varepsilon),\] with \(X_t \in \mathbb{R}^k, \mu \in \mathbb{R}^k, A \in \mathbb{R}^{k \times k}\).

Interpretation

In two dimensions: \(X_t=[X_{t,1} \ X_{t,2} ]'\).

\[X_t = \mu + \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix} X_{t-1} + \text{error}\]

So \(A_{11}\) measures the impact from \(X_{t-1, 1}\) onto $X_{t,1} $ (itself) and same for \(A_{22}\) for the second component. The non-diagonal terms drive the cross-effects: \(A_{12}\) is the impact of \(X_{t-1,2}\) on \(X_{t,1}\) and vice-versa.

A few probabilistic properties.

Mean

If the VAR(1) is stable (all eigenvalues of \(A\) satisfy \(|\lambda_i(A)|<1\)), the unconditional mean exists and is given by \[\mathbb{E}[X_t] = (I - A)^{-1}\mu.\]

Next, let \(\Gamma_0 = \mathbb{V}(X_t)\), it must satisfy \[ \Gamma_0 = A \Gamma_0 A' + \Sigma_\varepsilon.\]

Next, define the lag-\(h\) autocovariance matrix: \[\Gamma_h = \text{Cov}(X_t, X_{t-h}) = \mathbb{E}[X_t X_{t-h}'].\]

Using for simplicity \(X_t = A X_{t-1} + \varepsilon_t\) (the constant terms do not affect the covariances): \[ \Gamma_h = \text{Cov}(A X_{t-1} + \varepsilon_t, X_{t-h}) = A \text{Cov}(X_{t-1}, X_{t-h}) + \underbrace{\text{Cov}(\varepsilon_t, X_{t-h})}_{=0} = A \Gamma_{h-1} \]

Autocovariance

By recursion, we obtain the multivariate version of the AR(1): \[\Gamma_h = A^h \Gamma_0, \quad h \ge 0.\] Scaling by the variance, we get the autocorrelation: \(A^h\)!

3.2 The VAR(p) Model

A Vector Autoregression of order \(p\), denoted VAR(\(p\)), models a \(k\)-dimensional time series vector \(\mathbf{X}_t = (X_{1,t}, \dots, X_{k,t})^\top\) as:

\[ \mathbf{X}_t = \mathbf{c} + A_1 \mathbf{X}_{t-1} + A_2 \mathbf{X}_{t-2} + \cdots + A_p \mathbf{X}_{t-p} + \mathbf{e}_t, \]

where:

  • \(\mathbf{X}_t\) is a \(k \times 1\) vector of endogenous variables,
  • \(\mathbf{c}\) is a \(k \times 1\) vector of intercept terms,
  • \(A_i\) are \(k \times k\) coefficient matrices,
  • \(\mathbf{e}_t\) is a \(k \times 1\) vector of innovations (reduced-form errors).

Error term assumptions: the innovations satisfy

\[\mathbf{e}_t \sim (0, \Sigma_e),\]

where \(\Sigma_e\) is a symmetric positive definite \(k \times k\) covariance matrix, allowing contemporaneous correlation across equations.

3.3 Empirics

VARs are often used on macroeconomic data. Below, we fetch a sample from the World Bank API.

wb_raw <- WDI(                              # World Bank data
  indicator = c(
    "labor" = "SL.TLF.TOTL.IN",             # Labor force (# individuals)
    "savings_rate" = "NY.GDS.TOTL.ZS",      # Savings rate (% GDP)
    "inflation" = "FP.CPI.TOTL.ZG",         # Inflation rate
    "trade" = "NE.TRD.GNFS.ZS",             # Trade as % of GDP 
    "pop" = "SP.POP.TOTL",                  # Population
    "pop_growth" = "SP.POP.GROW",           # Population growth
    "capital_formation" = "NE.GDI.TOTL.ZS", # Gross capital formation (% GDP)
    "gdp_percap" = "NY.GDP.PCAP.CD",        # GDP per capita
    "RD_percap" = "GB.XPD.RSDV.GD.ZS",      # R&D per capita
    "educ_level" = "SE.SEC.CUAT.LO.ZS",     # % pop reachiing second. educ. level
    "educ_spending" = "SE.XPD.TOTL.GD.ZS",  # Education spending (%GDP)
    "nb_researchers" = "SP.POP.SCIE.RD.P6", # Nb researchers per million inhab.
    "debt" = "GC.DOD.TOTL.GD.ZS",           # Central gov. debt (% of GDP)
    "gdp" = "NY.GDP.MKTP.CD"                # Gross Domestic Product (GDP)
  ), 
  extra = TRUE,
  start = 1960,
  end = 2024) |>
  mutate(across(everything(), as.vector)) |>
  select(-status, -lending, -iso2c, -iso3c) |>  
  filter(region != "Aggregates", income != "Aggregates") |>
  arrange(country, year) |>
  group_by(country) |>
  mutate(across(everything(), impute)) |>
  mutate(gdp_growth = gdp_percap/dplyr::lag(gdp_percap) - 1, .before = "region") |>   
  ungroup() |>
  filter(lastupdated == max(lastupdated)) |>
  arrange(country, year) 

Ok, let’s focus on the US for simplicity. Taking into account other countries would mean/imply a Panel VAR.
To lighten the output (which can be heavy), we only focus on 3 variables.

wb_us <- wb_raw |> 
  filter(country == "United States") |>
  select(gdp_growth, inflation, pop_growth) |>
  na.omit()
fit_VAR <- VAR(wb_us, p = 1) 
fit_VAR |> summary()

VAR Estimation Results:
========================= 
Endogenous variables: gdp_growth, inflation, pop_growth 
Deterministic variables: const 
Sample size: 63 
Log Likelihood: 97.776 
Roots of the characteristic polynomial:
0.845  0.75 0.2559
Call:
VAR(y = wb_us, p = 1)


Estimation results for equation gdp_growth: 
=========================================== 
gdp_growth = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const 

              Estimate Std. Error t value Pr(>|t|)   
gdp_growth.l1 0.422576   0.141282   2.991  0.00405 **
inflation.l1  0.001643   0.001513   1.086  0.28192   
pop_growth.l1 0.004779   0.012570   0.380  0.70517   
const         0.020769   0.014248   1.458  0.15025   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.0253 on 59 degrees of freedom
Multiple R-Squared: 0.2914, Adjusted R-squared: 0.2554 
F-statistic: 8.087 on 3 and 59 DF,  p-value: 0.0001344 


Estimation results for equation inflation: 
========================================== 
inflation = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const 

              Estimate Std. Error t value Pr(>|t|)    
gdp_growth.l1 30.60299    8.81182   3.473  0.00097 ***
inflation.l1   0.56629    0.09439   6.000 1.29e-07 ***
pop_growth.l1 -0.83768    0.78398  -1.068  0.28965    
const          0.84518    0.88868   0.951  0.34546    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 1.578 on 59 degrees of freedom
Multiple R-Squared: 0.6864, Adjusted R-squared: 0.6705 
F-statistic: 43.05 on 3 and 59 DF,  p-value: 7.201e-15 


Estimation results for equation pop_growth: 
=========================================== 
pop_growth = gdp_growth.l1 + inflation.l1 + pop_growth.l1 + const 

              Estimate Std. Error t value Pr(>|t|)    
gdp_growth.l1 0.838545   0.535011   1.567    0.122    
inflation.l1  0.005373   0.005731   0.938    0.352    
pop_growth.l1 0.862075   0.047600  18.111   <2e-16 ***
const         0.059493   0.053956   1.103    0.275    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


Residual standard error: 0.09582 on 59 degrees of freedom
Multiple R-Squared: 0.8503, Adjusted R-squared: 0.8427 
F-statistic: 111.7 on 3 and 59 DF,  p-value: < 2.2e-16 



Covariance matrix of residuals:
           gdp_growth inflation pop_growth
gdp_growth  0.0006403   0.01862 -0.0004703
inflation   0.0186233   2.49093 -0.0238060
pop_growth -0.0004703  -0.02381  0.0091824

Correlation matrix of residuals:
           gdp_growth inflation pop_growth
gdp_growth     1.0000    0.4663    -0.1939
inflation      0.4663    1.0000    -0.1574
pop_growth    -0.1939   -0.1574     1.0000

Then, onto forecasting!

predict(fit_VAR, n.ahead = 3)
$gdp_growth
           fcst        lower      upper         CI
[1,] 0.04827954 -0.001316983 0.09787606 0.04959652
[2,] 0.05065569 -0.004265562 0.10557695 0.05492126
[3,] 0.05195311 -0.004846277 0.10875250 0.05679939

$inflation
         fcst       lower    upper       CI
[1,] 3.000879 -0.09247298 6.094230 3.093352
[2,] 3.223890 -0.97430754 7.422087 4.198197
[3,] 3.435725 -1.34243374 8.213883 4.778159

$pop_growth
          fcst     lower    upper        CI
[1,] 0.9528018 0.7649883 1.140615 0.1878135
[2,] 0.9374872 0.6911543 1.183820 0.2463330
[3,] 0.9274756 0.6415400 1.213411 0.2859356

Another potential package/function is {dynlm}, with documentation and tutorial for VAR.

4 A primer on prediction

4.1 Training/testing & co

The key idea is backtest: you have to put yourself in the shoes of someone living in the past. By this we mean, without knowledge of the future. To do this, you need to consider a loop that spans different dates in the past. At each point in time \(t\) you:

  1. extract the data available at time \(t\);
  2. build a predictive model;
  3. make some prediction based on the currently available data;
  4. evaluate the accuracy of the prediction thanks to the (future) realized values.

A simplified diagram below.

4.2 Some code

4.3 What really matters (in finance)

Models (and people) often focus on loss (objective) functions.
In finance, we usually try to forecast returns, hence the loss function is the (root-) Mean Squared Error:

\[\text{MSE} = \frac{1}{T} \sum_{t=1}^T (r_t-\tilde{r}_t)^2, \]

where \(r_t\) is the realized return and \(\tilde{r}_t\) is the model return. Assuming predictions are unbiased \((\mathbb{E}[r_t]=\mathbb{E}[\tilde{r}_t])\), then (after some math…)

\[\text{MSE} = \mathbb{E}[(r_t - \tilde{r}_t)^2]=\mathbb{V}[r_t]+\mathbb{V}[\tilde{r}_t]-2\text{Cov}(r_t, \tilde{r}_t)\]

Question: which of these terms matter the most?

5 Advanced temporal models

5.1 (Feedforward) neural networks

A few images help explain how simple NNs work.

First, the structure. The network takes an input which is then processed via many multiplications, additions and activations.
At the very end, we get the output. It can be a simple number, but could also be more complex (vector, matrix, etc.).

The intermediate activation functions are there to break the linearity. A few examples below:

Next, the training. This phase is intended to determine “good” parameter values. Obviously, the “best” parameters would be ideal, but they are out of reach (the problem is too complex).

The core idea is gradient descent: we seek to minimize a loss function (e.g., the MSE) and this depends on the derivative:

Now, there is a very interesting process called backpropagation that involves lots of computations of derivatives (skipped here):

In the end, the parameters (weights) of the model are locally optimal.
But they depend on the training sample: whether they work well out-of-sample (on testing sets) is another question.
If yes, the model is said to generalize well. That’s the great quest of machine learning.

5.2 Recurrent networks